---
id: evals-in-prod
title: Deployment
sidebar_label: Setup Evals in Production
---

In this section, we'll set up CI/CD workflows for your summarization agent, and learn how to add metrics and create spans with test cases in your application for better tracing experience.

## Setup Tracing

`deepeval` offers an `@observe` decorator for you to apply metrics at any point in your LLM app to evaluate any [LLM interaction](https://deepeval.com/docs/evaluation-test-cases#what-is-an-llm-interaction),
this provides full visibility for debugging internal components of your LLM application. We have added these decorators during development of our agent, we will now add metrics and spans for running online evals. [Learn more about tracing here](https://deepeval.com/docs/evaluation-llm-tracing).

Here's how we can add metrics and create spans for our `@observe` decorators in the `MeetingSummarizer` class:

```python {6,27,39,51-53,59,73-75}
import os
import json
from openai import OpenAI
from dotenv import load_dotenv
from deepeval.metrics import GEval
from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

load_dotenv()

class MeetingSummarizer:
    def __init__(
        self,
        model: str = "gpt-4",
        summary_system_prompt: str = "",
        action_item_system_prompt: str = "",
    ):
        self.model = model
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.summary_system_prompt = summary_system_prompt or (
            "..." # Use the summary_system_prompt mentioned above
        )
        self.action_item_system_prompt = action_item_system_prompt or (
            "..." # Use the action_item_system_prompt mentioned above
        )

    @observe(type="agent")
    def summarize(
        self,
        transcript: str,
        summary_model: str = "gpt-4o",
        action_item_model: str = "gpt-4-turbo"
    ) -> tuple[str, dict]:
        summary = self.get_summary(transcript, summary_model)
        action_items = self.get_action_items(transcript, action_item_model)

        return summary, action_items

    @observe(metrics=[GEval(...)], name="Summary") # Use the summary_concision metric here
    def get_summary(self, transcript: str, model: str = None) -> str:
        try:
            response = self.client.chat.completions.create(
                model=model or self.model,
                messages=[
                    {"role": "system", "content": self.summary_system_prompt},
                    {"role": "user", "content": transcript}
                ]
            )

            summary = response.choices[0].message.content.strip()
            update_current_span(
                input=transcript, output=summary
            )
            return summary
        except Exception as e:
            print(f"Error generating summary: {e}")
            return f"Error: Could not generate summary due to API issue: {e}"

    @observe(metrics=[GEval(...)], name="Action Items") # Use the action_item_check metric here
    def get_action_items(self, transcript: str, model: str = None) -> dict:
        try:
            response = self.client.chat.completions.create(
                model=model or self.model,
                messages=[
                    {"role": "system", "content": self.action_item_system_prompt},
                    {"role": "user", "content": transcript}
                ]
            )

            action_items = response.choices[0].message.content.strip()
            try:
                action_items = json.loads(action_items)
                update_current_span(
                    input=transcript, actual_output=str(action_items)
                )
                return action_items
            except json.JSONDecodeError:
                return {"error": "Invalid JSON returned from model", "raw_output": action_items}
        except Exception as e:
            print(f"Error generating action items: {e}")
            return {"error": f"API call failed: {e}", "raw_output": ""}
```

## Why Continuous Evaluation

Most summarization agents are built to summarize documents and transcripts, often to improve productivity. This means that the documents to be summarized are ever-growing, and your summarizer needs to be able to keep up with that. That's why continuous testing is essential — your summarizer must remain reliable, even as new types of documents are introduced.

**DeepEval**'s datasets are very useful for continuous evaluations. You can populate datasets with goldens, which contain just the inputs. During evaluation, test cases are generated on-the-fly by calling your LLM application to produce outputs.

In the previous section, we created a `deepeval` dataset. You can now reuse this dataset to continuously evaluate your summarization agent.

## Using Datasets

Here's how you can pull datasets and reuse them to generate test cases:

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="MeetingSummarizer Dataset")
```

## Integrating CI/CD

You can use `pytest` with `assert_test` during your CI/CD to trace and evaluate your summarization agent, here's how you can write the test file to do that:

```python title="test_meeting_summarizer_quality.py" {13}
import pytest
from deepeval.dataset import EvaluationDataset
from meeting_summarizer import MeetingSummarizer # import your summarizer here
from deepeval import assert_test

dataset = EvaluationDataset()
dataset.pull(alias="MeetingSummarizer Dataset")

summarizer = MeetingSummarizer()

@pytest.mark.parametrize("golden", dataset.goldens)
def test_meeting_summarizer_components(golden):
    assert_test(golden=golden, observed_callback=summarizer.summarize)
```

```bash
poetry run deepeval test run test_meeting_summarizer_quality.py
```

Finally, let's integrate this test into GitHub Actions to enable automated quality checks on every push.

```yaml {32-33}
name: Meeting Summarizer DeepEval Tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"

      - name: Install Poetry
        run: |
          curl -sSL https://install.python-poetry.org | python3 -
          echo "$HOME/.local/bin" >> $GITHUB_PATH

      - name: Install Dependencies
        run: poetry install --no-root

      - name: Run DeepEval Unit Tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} # Add your OPENAI_API_KEY
          CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }} # Add your CONFIDENT_API_KEY
        run: poetry run deepeval test run test_meeting_summarizer_quality.py
```

And that's it! You now have a **robust, production-ready summarization agent** with automated evaluation integrated into your development workflow.

:::tip Next Steps
Setup [Confident AI](https://deepeval.com/tutorials/tutorial-setup) to track your summarization agent's performance across builds, regressions, and evolving datasets. **It's free to get started.** _(No credit card required)_

Learn more [here](https://www.confident-ai.com).
:::
